5 Analyzing Core Diversity

Amanda Birmingham, CCBB, UCSD (abirmingham@ucsd.edu)

Table of Contents

Related Notebooks:

  • 1 Introducing 16S Microbiome Primary Analysis
  • 2 Setting Up Starcluster for QIIME
  • 3 Validation, Demultiplexing, and Quality Control
  • 4 OTU Picking and Rarefaction Depth Selection

Introducing Core Diversity Analyses

Once you have the OTU table for your data, there are a huge number of potential analyses that could be performed, depending on the hypotheses of your study. A good place to start, though, is with QIIME's core_diversity_analyses.py script, which runs a number of standard analyses of microbial diversity that provide a spring-board for deciding which study-specific additional analyses are merited.

The script produces a truly bewildering volume of output, as summarized in the output index.html and shown in this example:

The outputs fall into five main categories, some of which will be discussed further below:

  • Run summary data
  • Group/category significance results (for each category specified by the user at run time)
  • Taxonomic summary results (for all samples and for samples grouped by each category specified by the user at run time)
  • Alpha diversity results
  • Beta diversity results

Table of Contents

Running Core Diversity Analyses

Generating all this output takes some time, so this script is again parallelizable. The command has the format:

core_diversity_analyses.py -a -O [number of CPUs] -i [sequences file path] -m [mapping file path] -i [biom table] -t [phylogenetic tree file] -e [sequence depth] -c [comma-separated column names of chosen categories] -o [output directory path]  

as shown in this example:

core_diversity_analyses.py -a -O 23 -m /data/mapping_validation/925_merged_prep_mapping2_corrected.txt -i /data/open_ref_output/otu_table_mc2_w_tax_no_pynast_failures.biom -t /data/open_ref_output/rep_set.tre -e 10000 -c RoundedTemperature,site,Subsite -o /data/core_div_output/   

Note two caveats from my experience:

  • Although some of the calculations of this step are parallelized, some (specifically the generation of box-plots) are apparently not, and can be quite time-consuming depending on which and how many categories you select. I have seen this take overnight for long lists of categories.
  • It appears that the command is not robust to running with "categories" with a lot of distinct levels (like, say, continuous variables with 30+ measured values). In this case I have seen the command run a long time (e.g., 24 hrs) and then silently complete without generating the alpha diversity results.

The take-home here is that it is not advisable to just try to run the core diversity analyses for every meta-data category that might possibly be of interest; it is necessary to be a bit choosy.

Table of Contents

Interpreting Taxa Summaries

QIIME generates both bar and area plots showing the percentages of each microbe type in each sample or group. The output taxa_plots folder contains these for ungrouped samples, while the taxa_plots_[column name] folders contain them for samples grouped by the values of the metadata column in question. They are best viewed through the HTML interface; go to the "taxa_summary_plots" sub-folder in the taxa plots folder of interest and open either the area_charts.html or bar_charts.html page, depending on which you want to look at. I find the bar charts more useful, since the bar format reflects the fact that each sample and/or group is an independent and finite population, whereas the area charts seem to imply a progression between them.

The summary charts are available at six "levels", starting with L2 (phylum) and getting more granular to L6 ("species"--or rather, OTU). Frankly, they get pretty unreadable to me at more granular levels since there are usually so many different populations being represented in each bar (or slice of the area graph). Sometimes, though, especially at more general levels, they help with visualization. For example, this plot of L2-level microbial populations in Yellowstone hot springs (courtesy of Dr. Greg Caporaso) separated by temperature bin shows a marked increase in Aquificae-phylum bacteria (represented in brown) at high temperatures:

Note that unfortunately QIIME has a bug in which it does not properly output the level label at the top of the graph (see above: "Current Level:" is empty). However, looking at the legend makes it clear which level each graph shows.

Table of Contents

Understanding Alpha Diversity

Alpha diversity metrics are measurements of within-sample diversity. The output folder named arare_max[sequencedepth] (as in arare_max10000) contains the alpha diversity analyses for the dataset at the stated sequence depth (set in the call to core_diversity_analysis.py). By default, three different alpha diversity metrics are investigated:

  • observed OTUs: just what it says on the tin--the absolute number of OTUs observed in a particular sample/group/etc.
  • PD whole tree: "phylogenetic distance", a metric that adds up the branch length of the whole phylogenetic tree (as the name implies) back to the root of all reads in the sample/group/etc. Note that the trees used for this are NOT "ultrametric" (apparently, this is the technical term for a tree in which "the distances from the root to every branch tip are equal" (http://en.wikipedia.org/wiki/Distance_matrices_in_phylogeny ). What this implies is that PD_whole_tree will be larger if either a) the subtree for a given sample has more branches in it--is "bushier"--OR has some branches in it that have evolved FURTHER AWAY from the root than non-sample branches--are "longer". So its measuring within-sample diversity in terms of how far apart the species are in terms of evolution as well as how many of them there are. This is the metric that the Knight group prefers.
  • chao1: (pronounced "chow 1"--the developer's name, perhaps?) is a metric that attempts to correct for the species that one does not see due to the fact that each sample is only a finite subset of the environment it is taken from ... apparently it does some sort of calculation based on the very low-level (one, two, etc, counts) reads to guess how many should be there that you didn't get even one read from. The Knight lab doesn't favor it, but says reviewers often ask for it.

For each of these metrics, QIIME generates (among other things) rarefaction curves that show what value of the metric you would get for each sample/group/etc if you had only used various fractions of the actual chosen sequence depth. These are stored in the alpha rarefaction plots subfolder and are best accessed through the rarefaction_plots.html interface. For example, this shows a steady increase in observed OTUs for most samples (some more so than others) at increasing fractions of the chosen sample depth:

As we can see, this doesn't really level off, even at a sequencing depth of 10,000 reads per sample. However, Greg Caporaso points out that sequencing noise often shows up as new OTUs so rarefaction curves from NGS data often never level off. He states that he doesn't use alpha rarefaction plots very often; when they are used, it is mostly to gauge relative diversity across samples (as shown here).

QIIME also uses the alpha diversity metrics at the full specified sequencing depth (in this example, 10,000 reads) to generate boxplots showing the distribution of the metric across the samples in the input groups/categories. For example, here is a boxplot (again from Caporaso's Yellowstone data) showing one site (n=24) that had very high numbers of observed OTUs--that is to say, very high internal diversity in its microbial community. (It turns out this wasn't a surprise: that particular hot-spring site had been contaminated with buffalo feces, and feces have very high microbial divesity):

This example also demonstrates a limitation of QIIME: it appears to have trouble correctly scaling the long labels on the x-axis for its box-plots, which can make interpretation something of an archaeology exercise :(

Table of Contents

Understanding Beta Diversity

In most data sets, beta diversity is where the analyses are likely to get interesting, as beta diversity metrics measure the differences in microbial community composition BETWEEN different samples--such as those from different environments/treatments/genders/etc. The output folder named bdiv_even[sequencedepth] (as in bdiv_even10000) contains the beta diversity analyses for the dataset at the stated sequence depth (set in the call to core_diversity_analysis.py).

By default, two different alpha diversity metrics are investigated: unweighted and weighted UniFrac. UniFrac is a phylogenetically based metric that essentially looks at how the phylogenetic tree of observed sequences is divided (or not) by sample. In the graphical description below (courtesy of Dr. Rob Knight), blocks represent sequences; the color(s) on each block represent the sample(s) in which that sequence was detected:

UniFrac adds up the percentage of phylogenetic branch length in the tree that is NOT shared between two samples. Weighted UniFrac is much the same but the results are weighted by the abundance of the sequences instead of being based solely on whether or not they are present. My understanding is that unweighted and weighted measures are each useful in some circumstances, so QIIME provides them both. They are the two default beta diversity metrics calculated by core_diversity_analyses.py. Specifically, QIIME calculates each of these metrics between each pair of samples, producing one square (but symmetric along the diagonal) matrix of distances between samples for each of the two metrics. These distance matrices are in the unweighted_unifrac_dm.txt and weighted_unifrac_dm.txt output files.

The values in these distance matrices are used to produce boxplots of the distribution of distances between samples in the user-specified groups/categories. These are stored in the unweighted_unifrac_boxplots and weighted_unifrac_boxplots folders. This example shows unweighted unifrac distance distributions between various subsites in the Caporaso Yellowstone data:

Each of these distance matrices is also used as the starting point for a principal coordinate analysis. Principal coordinate analysis (PCoA) is essentially a way to map (potentially) non-Euclidean distances into a Euclidean space to enable further investigation. The n distances from the input distance matrix are projected into n-1 dimensions and then principal components analysis (PCA) is performed to reduce the dimensionality of this space back down to something we can comprehend, capturing as much of the variation as possible between the points in the earliest principal components. Note that because the starting point is distance matrix (not a full set of features about each sample, as is usual in PCA), PCoA does not retain any information about the original measured variables and can't be used look for patterns in these variables; it is strictly useful for understanding the similarity or difference between the particular input cases. However, because it is not constrained (like PCA) to looking at correlation/covariance matrices, PCoA can be used with any distance matrix and thus is useful (I'm told) for various fields like microbial ecology where the "distances" of interest are frequently non-Euclidean.

The principal coordinate analysis results are viewable through Emperor, a browser-based viewer. In the unweighted_unifrac_pcoa_plot or weighted_unifrac_emperor_pcoa_plot folder, open the index.html file to see a 3-D graph of all the samples in the first three principal coordinates. The Emperor interface (see more details here: http://biocore.github.io/emperor/ ) allows ways to color these points by various metadata categories, which can reveal correlations between the principal coordinates and the metadata. For example, here is the Caporaso Yellowstone data (Steep Cone only) weighted unifrac PCoA graph, colored by temperature (lighter colors = cooler, darker colors = warmer):


In [ ]: